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Abstract 

Privacy has become a major concern in Online Social Networks (OSNs) dne to threats snch 
as advertising spam, online stalking and identity theft. Although many users hide or do not 
fill out their private attributes in OSNs, prior studies point out that the hidden attributes 
may be inferred from some other public information. Thus, users’ private information could 
still be at stake to be exposed. Hitherto, little work helps users to assess the exposure 
probability/risk that the hidden attributes can be correctly predicted, let alone provides 
them with pointed countermeasures. In this article, we focus our study on the exposure 
risk assessment by a particular privacy-sensitive attribute - current city - in Facebook. 
Specifically, we first design a novel current city prediction approach that discloses users’ 
hidden ‘current city’ from their self-exposed information. Based on 371, 913 Facebook users’ 
data, we verify that our proposed prediction approach can predict users’ current city more 
accurately than state-of-the-art approaches. Furthermore, we inspect the prediction results 
and model the current city exposure probability via some measurable characteristics of the 
self-exposed information. Finally, we construct an exposure estimator to assess the current 
city exposure risk for individual users, given their self-exposed information. Several case 
studies are presented to illustrate how to use our proposed estimator for privacy protection. 
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1. Introduction 


During the last decade, Online Social Networks (OSNs) have successfully attracted bil¬ 
lions of people who share a huge amount of personal information through the Internet, such 
as their background, preferences and social connections. Owing to the increase of potential 
violations such as advertising spam, online stalking and identity theft ISj], in recent years, 
more and more users have concerns about their privacy in OSNs and become reluctant to 


publish all their personal information Ij. Consequently, users may not £11 out their privacy- 
sensitive attributes (e.g., location, age, or phone number), or they hide this information from 
strangers and only allow their friends to view such information j?]. 

While hiding the privacy-sensitive attributes, users usually expose some other informa¬ 
tion that appears to be less sensitive to them. It has been reported that Facebook users 

n 

publicly reveal four attributes on average, and 63% of them uncover their friends list |13l |. 
Due to the correlations among various attributes, some of the self-e^osed information 
may indicate the invisible privacy-sensitive attributes to some extent 6| 2^. Hence, it is 
questionable whether the privacy-sensitive attributes that a user intends to hide are really 
hidden. 

This work, using location information as a representative case, aims to assess what is 
the risk that a user’s invisible information could be disclosed. There are several reasons 
that lead us to conduct this study based on location information. First, among various 
kinds of information, location is usually one of the privacy-sensitive attributes for most 


users 


In real-life OSNs, we notice that users are quite careful to not reveal their location 


information: 16% of users in Twitter reveal home city 23| and 0.6% of Facebook users 

n 

publish home address p[. Second, location information is a commercially valuable attribute 
which might even be misused by unscrupulous businesses to bombard a user with unsolicited 


marketing 12|. In addition, location information leakage may lead to a spectrum of intrusive 
inferences such as inferring a user’s political view or personal preference jl2| j^|. Therefore, 
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protecting the hidden location information for a user becomes rather critical. In particular, 
as Facebook is the most popular OSN 3^, we concentrate on the attribute of current city 
in Facebook and investigate the following issues: 

1) Is the private current city that a user expects to hide really hidden? In other words, 
if a user hides his current city but exposes some other information, can we predict a user’s 
current city by using his self-exposed information? 

2) Can we help individual users to understand the actual risk (probability) that their 
private current city could be correctly predicted based on their self-exposed information? 
Furthermore, can we provide some countermeasures to increase the security of the hidden 
current city? 

To address these issues, we hrst propose a current city prediction approach to predict 
users’ hidden current city. Although many location prediction approaches have been de¬ 


veloped for Twitter 


BSEi 


32| and Foursquare |29|[3^, they cannot be appropriately 


implemented on Facebook because of the different properties (e.g., obtainable information) 
in these OSNs. For Facebook, Backstrom et ah predict users’ locations based on their 


friends’ locations |3[. In addition to friends’ locations, users’ prohle attributes, such as 
hometown, school and workplace, may also indicate their current city to some extent joj. In 
order to achieve high prediction accuracy in Facebook, we devise a novel current city pre¬ 
diction approach by extracting location indications from integrated self-exposed information 
including prohle attributes and friends list. 

Second, based on the proposed prediction approach, we construct a current city exposure 
estimator to estimate the exposure probability that a user’s invisible current city may be 
correctly inferred via his self-exposed information. The exposure estimator can also provide 
a user with some countermeasures to keep his hidden current city hidden. To the best of our 
knowledge, this is the hrst work that estimates the exposure probability of a user’s invisible 
attribute by his self-exposed information. 

It is a non-trivial task to construct either the current city prediction approach or the 
exposure estimator. We encounter the following challenges: 
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1) How to extract and integrate different location indications from a user’s 
multiple self-exposed information? Since the proposed prediction approach explores 
location indications from both prohle attributes and friends list, two subproblems are con¬ 
sidered. {i) A user probably reveals multiple attributes (e.g., hometown, workplace) which 
may indicate different locations; besides, a certain attribute might indicate several locations. 
For example, a user working in Google suggests that the user could probably live in any 
city where Google sets up an office, e.g., California, Beijing or Paris, (ii) The friends 
of a user, probably residing in different cities, may be close to or far away from the user. 
These strong or weak geographic relations may influence the significance of the friends’ lo¬ 
cation indications. Thus, it is challenging to appropriately combine these various location 
indications into an integrated model, so as to determine the probabilities of locations where 
the user may live. 

2) How to predict a user’s current city when we obtain the probabilities of 
the user being at various locations? By overcoming challenge 1, we can obtain a 
probability vector which indicates the probabilities that a user resides at certain locations. 
With this probability vector, a straight-forward prediction approach could select the location 
with the highest probability as the user’s current city. However, this might not be the best 
option when concerning the locations’ geographic relations. Assume the probability vector 
suggests that a user u has 40%, 35% and 25% probability of residing in Beijing, Paris 
and Evry respectively. Then, u is more likely to live in the area around Paris and Evry 
than Beijing, because Paris and Evry are only 30/cm apart but they are thousands of 
kilometers away from Beijing. Hence, a location selection method should be carefully 
designed for a current city prediction approach. 

3) How to estimate the exposure risk of a user’s hidden current city? To help 
a user understand the exposure risk of his hidden current city, a straight-forward method 
would be to provide a predicted location; thus the user can decide whether his current city 
can be predicted correctly (risky) or incorrectly (secure). However, this method may not 
meet users’ expectations. A user, whose location is correctly predicted, may expect being 

able to know which of his self-exposed information primarily leads to the leakage of his 
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private current city and how to increase its security. A user, whose hidden location is not 
predicted correctly, still needs to be aware of some leakage of location that may exist. For 
example, a prediction approach may incorrectly infer a Parisian living in Lyon according 
to probabilistic results: 55% in Lyon and 45% in Paris; Even though the prediction result 
is incorrect, the user still leaks some location information. Therefore, how to estimate the 
current city exposure risk and help a user achieve his desired privacy level is a challenging 
objective. 

This paper makes the following contributions: 

1) Profile and friend location indication model: To properly extract location indi¬ 
cations from users’ self-exposed information, we construct an integrated probability model. 
We capture location indications from two types of information: location sensitive attributes 
and friends list. Location sensitive attributes are the prohle attributes that can indicate 
one or multiple locations. In this paper, we use ‘Hometown’ and ‘Work and Education’ as 
the location sensitive attributes. For each location sensitive attribute, we set up a location 
attribute indication matrix from which we can index the locations and the corresponding 
probabilities that a certain attribute value indicates. Besides, considering a user and each of 
his friends who publish current city, we estimate their location similarity according to their 
attribute correlations, and assign a large weight to a friend that has a high location similarity 
to the user. For a friend who does not reveal current city, we predict the friend’s current city 
using his visible location sensitive attributes, and assign him a very small weight. Finally, 
based on information from 371,913 users collected from Facebook, we train an integrated 
model that can determine the probability for each potential city where a user may reside. 

2) Current city prediction approach: To address Challenge 2, we aggregate locations 
into clusters by considering the locations’ geographic relations. Then, based on the proposed 
profile and friend location indication model, we predict a user’s invisible current city in two 
steps: {i) eluster-selection: for each cluster, we sum up the probabilities of locations inside 
the cluster; then we select the cluster with the highest probability; {ii) location-selection: 
we determine a best location within the selected cluster as the user’s current city. The 

evaluation results demonstrate that our proposed prediction approach achieves lower error 
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distance and higher accuracy than the state-of-the-art approaches. Furthermore, for the 
users who reveal their ‘Hometown’ and ‘Work and Education’, our proposed approach can 
predict current city with an accuracy of 90%. 

3) Current city exposure estimator: We dehne some measurements to describe the 
characteristics of users’ self-exposed information. Based on these measurements, we analyze 
how the users’ self-exposed information affects the probability that their current city may be 
correctly inferred (i.e., current city exposure probability). Furthermore, Random Decision 
Forest method is employed to model the current city exposure probability, and subsequently 
a current city exposure estimator is constructed. Given a user’s self-exposed information, 
the proposed exposure estimator provides two estimators — Exposure Probability and Risk 
Level — to quantify the current city exposure risk. The exposure estimator can also esti¬ 
mate the exposure risk assuming that the user hides some of his self-exposed information. 
Consequently, the user can easily decide which information he should hide to satisfy his 
privacy intention. 

The rest of this paper is organized as follows. We review the literature in Sec. |2l formulate 
the current city prediction problem in Sec. [3l and overview our solution to the prediction 
problem in Sec. HI Next, the prohle and friend location indication model is devised in 
Sec. El the current city prediction approach is respectively presented and evaluated in Sec. El 
and Sec. [71 By inspecting the current city prediction results. Sec. [HI proposes the exposure 
estimator. Finally, Sec[9l makes some discussions and points out future work. Sec. [JOl 
concludes this work. 

2. Literature Review 

In this section, we briefly review the related work from two perspectives: city-level 
location prediction and privacy in OSNs. 

2.1. City-Level Location Prediction 

Existing city-level location prediction approaches can be classihed into four categories: 
relationship-based prediction, content-based prediction, hybrid content-relationship predic- 
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tion and multi-indication prediction. 


2.1.1. Relationship-based Prediction 

Based on the principle that the probability of being friends is declining with geographic 
distance, this prediction category infers a user’s location according to the visible locations 
of his friends 3| . Researchers have studied the correlation between geographic distance and 
social relationship on large-scale Facebook users in United States. They reveal that the 
probability of being friends falls down monotonically as the distance increases. Depending 
on this observation, they build a maximum-likelihood location prediction model and hnally 
rehne the prediction with an iterative algorithm. 

2.1.2. Content-based Prediction 

The rise of Twitter has spawned a mass of tweets. As some tweets contain location- 


specihc data, this category of prediction approaches [8| [2l| infers a user’s location relying 
on his location-related tweets. The basic idea of these approaches is to detect the location- 
related tweets and construct a probabilistic model to estimate the distribution of location- 
related words used in tweets. In order to raise the prediction accuracy, the basic idea is 
improved by various means, such as such as selecting the top K probable cities [5|, identi¬ 
fying words with a strong local geo-scope and rehning the prediction with a neighborhood 
smoothing model 8|. 


2.1.3. Hybrid Content-Relationship Prediction 

Another compelling category combines the location indications from relationships and 
tweet content. TweetHood identihes a user’s location by exploring both his tweets and his 
closest friends’ locations [ij]. Tweecalization improves TweetHood by employing a semi- 
supervised learning algorithm and introducing a new measurement which combines trust¬ 
worthiness and the number of common friends to weight friends j^. Li et al. integrate the 


location influences captured from bot 


1 social network and user-centric tweets into a unihed 


discriminative probabilistic model 


tiple locations, MLP model 


24| . By considering a user who may be related to mul- 


23| proposes to set up a complete ‘location prohles’ prediction 
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which infers not only a user’s home location but also his other related locations. 


2.1.4- Multi-Indication Prediction 

Besides users’ relationships and content, multi-indication prediction approaches explore 
multiple location indications from other possible location resources to infer users’ invisible 
location. To resolve ambiguous toponymies in tweet content, besides location indications 
extracted from tweets, existing work has introduced location indications from websites’ 
country code, geocoded IP addresses, time zone and UTC24-offset |33j. Such a multi¬ 


indication idea has also been used to Foursquare, which specihcally exploits mayorships, tips 
and dones that users marked [30| . However, all these multi-indication prediction approaches 
are proposed for either Twitter or Foursquare, but not for Facebook. Our previous work 
reveals the statistical analyzed correlation between users’ current city and other location 
sensitive attributes in Facebook |6|. It also predicts a user’s current city with city-level 
and country-level results by using a neural network approach. However, this previous work 
assumed that an attribute value could map to a specihc location, which is not true for many 
cases. Recall the example that a user works in Google might work in California, Paris 
or Beijing {Challenge 1 of Sec. [I]). 

In this paper, we consider multiple location indications by integrating relationship and 
prohle attributes in Facebook. Compared to our previous work where an attribute value 


only allows to bind with one hxed location 


^ , an attribute value can be mapped to multiple 


locations with different probabilities in the newly proposed model. In addition, we consider 


both the friends whose current city is either visib 
relies on the friends who reveal their locations 


e or invisible; whereas the existing work 
3| j24| . Particularly, we propose a new 


approach to bias the weights of friends whose current city is visible. 

2.2. Privacy in OSNs 

In OSNs, users are more and more concerned with privacy of their personal informa¬ 


tion 


in[ |. A majority of users conhgure their privacy settings and hide some of their 


in¬ 


formation from strangers. Unfortunately, previous research has pointed out the disparity 
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between the expectation and the reality of users’ privacy; and it has also showed that much 
of users’ private information is easily uncovered 271. 

Much existing work ascribes the privacy leakage to the users themselves. On one hand, 
users might incorrectly manage their privacy settings due to the poor human-computer 


interaction or complex privacy maintainability 


27| To address this issue, researchers have 


designed a user-friendly interface for managing privacy settings with an audience view 


26|. 


On the other hand, users only hide some of the attributes that are privacy-sensitive to 
them while make the others accessible to public — users on Facebook generally expose more 
than four attributes to strangers and 63% of users share their friend lists with the public 131. 
As reported, such user self-exposure behavior leaves a huge chance for inferring the hidden 


attributes 


ll|][28| 2^. Many tools have been developed to infer users’ invisible information 


by various means such as inferring the private information through users’ other self-exposed 


information 


, their social connections 


Hi 201 and social groups 40| 22|. 


Some papers claim t 


lat it is hard for a user to avoid privacy leakages if he only hides 


36|; whereas many studies merely suggest users with a general 


the private attribute jll| j28| 
idea of hiding other attributes so as to become more secure (e.g., hide relationships [19|). 
Unlike the above work, we provide an individual user with the exposure probability of his 
private current city concerning his self-exposed information. We also suggest some pointed 
rules for protecting users’ privacy on their current city. 


3. Formulation of Current City Prediction Problem 

In this section, we formulate the current city prediction problem. Facebook, as a so¬ 
cial network containing location information, can be viewed as an undirected graph Q = 
{U,£,C), where W is a set of users; is a set of edges e{u,v) representing the friend rela¬ 
tionship between users u and v, where u and u G W; £ is a candidate locations list composed 
of all the user-generated locations. 

Typically, a user u in Facebook might contribute various items of information, e.g., basic 
prohle information, friends, comments and photos. The core information of u in this paper 

is the user’s current city, denoted as l{u). The users are classihed into two sets according to 
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the accessibility of users’ current city: current city available users (LA-users) and current 
city unavailable users (LN-users). We, respectively, use and to denote the sets of 
LA-users and LN-users, where U = U . 

To predict users’ current city, we exploit the users’ location sensitive attributes and 
friends list. Assume that there exist m types of location sensitive attributes, denoted as 
A = {ai,a 2 ,--- ,am}- Specihcally, we denote a user u’s location sensitive attributes as 
A{u) = {ai('u), a 2 ('u), • • • , am{u)}. The users may also have a friends list, denoted as J^{u), 
where -T('u) = {f E U : e{u, f) ^ Therefore, we use a tuple to represent a user as 
u : {l{u),A{u),J^{u)). 

Additionally, each location is associated with a unihed ID {lid)- Then, with this ID, 
we can obtain each location’s latitude and longitude coordinate via Facebook Graph API 
Explorer. Therefore, a location can also be written as a tuple: I : {kd, lat, Ion) and the 
candidate locations list can be denoted as a set of location tuples: C = {I : {kd, lat, lon)}N, 
where lat and Ion respectively stand for the latitude and longitude of a location, and N is 
the number of candidate locations in the list. 

Thus, the current city prediction problem can be formally stated as: Given, (i) a 
graph Q = ,£,C); (a) the publie location l{u) for LA-users u G ; (Hi) the 

location sensitive attributes A{u) and the friends list IF{u) for all the users u G 
we predict current city l{u) for each LN-user u G , so as to make l{u) close to the user’s 
real current city. 

Note that the current city of a user’s friends can be either available (/ G IL^^) or 
unavailable (/ G U^^). Thus, we introduce two notations to represent the two groups 
of friends: current city available friends (LA-friends) and current city unavailable friends 
(LN-friends). Let denote a user’s LA-friends as and LN-friends as where 

4. Overview of Current City Prediction 

The goal of current city prediction is to correctly infer a coordinate point with latitude 

and longitude for a LN-user, given the candidate locations list C and the user’s self-exposed 
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information inclnding his location sensitive attribntes and friends list. Figure [T] illustrates 
the framework of the proposed current city prediction solution. To determine the current 
city of a LN-user, we hrst train an integrated prohle and friend location indication (i.e., 
PFLI) model to compute the probabilities of the candidate locations in which the LN-user 
may currently live. Next we take a two-step location selection strategy: cluster selection 
and location selection. Specihcally, we aggregate the nearby locations into a location cluster 
and obtain a set of location clusters. We then calculate the probability of a user being in 
a cluster by summing up the probabilities of all the candidate locations belonging to this 
cluster; the cluster with the highest probability is picked out as a candidate cluster. Finally, 
we try to select the ‘best’ location from the candidate cluster as the predicted current city. 


Current City Prediction 


Training 



L.A-l'sers 

Ml 




Friends 

Profile 

+ 

L.4-Friends LN-Friends 

ll I» 



Profile & Friend Location 
Indication fPFLI) Model 


Figure 1: Framework of Current City Prediction. 

To train the integrated PFLI model (see the right-hand part of Figured]), we separately 
consider the location indications from location sensitive attributes and friends, and conse¬ 
quently obtain two sub-models: prohle location indication {PLI) model and friend location 
indication {PLI) model. Both PLI model and FLI model calculate a probability vector in 
which the element stands for the probability of a user being at a certain candidate location. 
Note that, FLI model leverages the location indications from both LA-friends and LN- 
friends. By integrating the probability vectors that are generated by PLI and FLI models 
with appropriate parameters, a unihed prohle and friend location indication (PFLI) model 
is derived. 
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Next, we will elaborate the PFLI model and the current city prediction approach. 

5. Profile and Friend Location Indication Model 

In this section, we describe the design of the probabilistic models that can suggest the 
probabilities of users being at each of the candidate locations. We hrst introduce the profile 
location indication {PLI) model; it estimates the probability of each candidate location by 
merely relying on a user’s location sensitive attributes. Then, we describe the friend loca¬ 
tion indication {PLI) model, which captures the location indications from a user’s friends. 
Finally, we integrate these two models and obtain the integrated profile and friend location 
indication {PFLI) model. 

5.1. Profile Location Indication Model 

According to Challenge 1 in Sec. [H two problems should be considered in constructing 
PLI model. First, a certain value of a location sensitive attribute may indicate several 
locations. For instance, Google, being a certain value of workplace, could indicate any 
city where Google sets up an office such as Galifornia, Beijing or Paris. Therefore, 
for each attribute value, we consider all possible location indications with the corresponding 
probabilities. Second, a user may present multiple location sensitive attributes (e.g., home¬ 
town, workplace, college). Thus we integrate various location indications extracted from 
different location sensitive attributes. 

To capture the multiple possible location indications from one attribute value, we dehne 
a location-attribute indication matrix for each (k-th.) location sensitive attribute G A, 
denoted as Hk- The rows of this matrix represent the candidate locations (/ G £), while 
the columns stand for the possible values of au- We use k to represent the Tth candidate 
location and Ok^ to denote the j-th possible value of a^. A cell in the matrix calculates 
the indication probability of Ok^ to h — the probability that a user, whose A:-th location 
sensitive attribute Uk equals Ukj, currently lives in the city k. Specihcally, the indication 
probability equals the number of users who live in h and have a value of divided by the 
total number of users who have a value of Ok^ For instance, considering workplace, if 10 out 
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of 100 employees from Telecom SudParis in the whole data set state that they live in 
Evry, then the indication probability of Telecom SudParis to Evry is 0.1. Note that, 
the j-th column of TZk represents the multiple location indications of ■ 

Assume that has M possible values except null] N is the total number of the candidate 
locations. The k-th location-attribute indication matrix can be written as: 


~ Wk}NxM — {p{l{u) — li\ak{u) — akj)}NxM — [R-ki, Rk2t 


where R.k^ represents all the locations’ probabilities for a user who presents aky 

Based on the location-attribute indication matrix (7^), we model the probability of a 
user’s current city at li by combining all of a user’s available location sensitive attributes in 
his prohle: 


Pp^^^{u,li)= ^ akpiliu) = li\ak{u) = akj) 

a/c ^■A,ak {u)^null 

afe {u)^null 


( 1 ) 


where ak{u,li) can be easily obtained by indexing the corresponding location-attribute in¬ 
dication matrix (TZk) according to m’s value of ak {akiu) = ak^) and the given location (/j), 
namely ak is a parameter to adjust the signihcance of the different location sensitive 
attributes. 

As we discussed in Sec. [3l a user may not reveal some attributes. Therefore, in Eq. 
[U the location indication from the attribute ak{u) at any location equals zero if the user’s 
ak{u) is invisible. If all of a user’s location sensitive attributes are invisible, we rely on his 
friends’ information to infer his current city, which we will discuss in the next section. 


5.2. Friend Location Indication Model 

In addition to a user’s location sensitive attributes, we explore location indications from 
users’ friends to construct FLI model. A user’s friends can be either LA-friends (current 
city available) or LN-friends (current city unavailable). We build up FLI model primarily 
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depending on LA-friends’ location indications and also considering LN-friends’ location indi¬ 
cations as a small regulator. Accordingly, FLI model contains two components: LA-friends 
location indication (LA-FLI) model and LN-friends location indication (LN-FLI) model. 

5.2.1. LA-FLI Model 

LA-FLI model differentiates the weights of a user’s LA-friends and estimates his proba¬ 
bility of living in location by the weights of his friends who also live in /j. LA-FLI model 
expects to assign high weights to the LA-friends who live in the same city as the user does. 
However, since the user’s city is unknown, whether or not a friend and the user live in the 
same city cannot be directly determined. Therefore, LA-FLI model assesses the likelihood 
that two users live in a same city (i.e., location similarity) based on the correlation between 
their location sensitive attributes. Figure [2] illustrates an example to show that the location 

Work: Telecom SudPairs 
Current City: <PARIS,FR,48.9,2.4> 

Hometown: < PARIS,FR,48.9,2.4> 

Work: Telecom SudPairs 

--- -ffiy^urrent City: < EVRY,FR,48.6,2.5 > 

^^^^,,,*«---‘’*^Homet^ <PARIS,48.9,2.4> 

Work:? Work: Baidu 

Current City: .. Current City: <Beijing,CN,39.9,16.4 > 

Hometown: <ZJJ,CN ,29.3,lia5>^(^ Hometown: <ZJJ,CN,29.3,110.5> 

-user (^LN-user 

Figure 2: An Example of Social Relations and Profile Information. 

sensitive attributes can be used to distinguish the weights among various LA-friends. Focus¬ 
ing on LN-user U 2 and his LA-friends u^, and U 5 , we notice that U 2 and U 3 , U 4 work in the 
same institute, while U 5 works in another company which is far away from M 2 ’s workplace. 
In this case, it is natural to infer that U 2 is more likely to be living in the same city with 
and M 4 than with ^ 5 ; then and ^4 should be assigned with higher weights than because 
of the location similarity indicated by their workplace. 

Inspired by the example, we construct an attribute-based location similarity matrix (Wfc) 
by each (fc-th) location sensitive attribute (ofc & A). In the matrix, a cell calculates 
the probability that two users live in the same city (i.e., location similarity) when they 
respectively have values of a*,, and regarding a^. Specihcally, we compute the total 


Work: Telecom SudPaij 
Current City; ] 
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number of friend pairs where one user has a value of aki and the other has a value of 
denoted as |{afc('u) = Aafc(u) = afc^}|; Among these friend pairs, we further count the pairs 
of friends who live in the same city, denoted as \{l{u) = l{v) A ak{u) = ak^ A ak{v) = afc^.}|. 
Then, 

Wfc = {w]^}mxM 

= {p{l{u) = l{v)\ak{u) = ttki A ak{v) = UfcJjMxM 
_ = K'lj) ^ O.kiv) = UfcJI ^ 

L ir/\ a/n 'll jA^xM 

\{ak{u) = aki 

where M is the number of possible values of attribute ak including null. 

For a certain attribute a*,, assume that u and his LA-friend v have a value of aki 
akj respectively. Then, the u and u’s location similarity on ak can be easily obtained by 
indexing the Ath row and j-th column of Wk, denoted as Wk{u, v) = w]^, v G 

We combine multiple location similarities on all the location sensitive attributes (e.g., 
work, hometown) with a set of trained parameters {/3) to measure u’s weight. This combined 
weight describes the probability that u and v live in the same city concerning all of their 
location sensitive attributes. 

Then, LA-FLI model calculates the probability of u living in li by integrating all the 
weights of u’s LA-friends who live in Ip 

p^^_Au,li)= ^kWk{u,v)p^^_^{v,li) ( 2 ) 

where p^^_ij{v,li) represents whether or not the LA-friend v living in /*. It equals 1 if u 
states his current city is Ip otherwise, it is 0 : 

[l lfl{v) = li 
PLA-ui'^Ji) = \ 

I 0 otherwise 

5.2.2. LN-FLI Model 

Before introducing LN-FLI model, we inspect the potential beneht of a user’s LN-friends 

for his current city prediction with another example shown in Figure |2j We observe that 
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M 2 , being a LN-friend of tti, does not expose his current city; whereas, the workplace of M 2 , 
Telecom SudParis, indicates two cities — Paris and Evry — according to the current 
cities of the users M 3 and M 4 who are also the employees of Telecom SudParis. Thereby, 
a user’s LN-friends can also reveal some location indications in their exposed attributes, 
which may help the prediction. 

Therefore, for a LN-friend v, we hrst rely on his exposed location sensitive attributes 
and use PLI model (Sec. 15.ip to predict his current city, as: 

Pp^^fiv, k) = ^ oikp{l{v) = li\ak{v) = 

CLk GA,ak; {v)^null 

Treating all the LN-friends equally, LN-FLI model integrates LN-friends’ location indi¬ 
cations and computes the probability that u lives in /j G £ as: 

^ ^ Pprofi^l^i) (3) 

vGF^n 

5 . 2 . 3 . FLI Model 

Finally, primarily relying on LA-FLI model and being adjusted by LN-FLI model with 
a small regulator parameter A, FLI model estimates the probability that u currently lives 
in li as: 

Pf {u, k) = Pp^_p (r, k) + (r, k) (4) 

5.3. Integrated Profile and Friend Location Indication Model 

Next, we discuss how to integrate PLI model and FLI model into a unihed probabilistic 
location indication model, so as to capture the complete location indications. Specihcally, 
PFLI model calculates the probability of u living in h G £ as: 

p(m, k) = OpPp^^^ (m, k) + 6pPp (m, k) (5) 

Parameter Computation: To obtain a set of good parameters for the model, we hrst 


16 



rewrite the model as: 


p{u,li) = ^ OpakCTkiuJi) 

Q-fc GirA 

+ XI X Mu,v)p LA-F 

ak€A v&T^^(u) 

+ X X ^kCTkivJi) 

dk (u) 

^ ^ ^i) H“ ^z)] “1“ 

cifg G»4. 

where 


• fjj}. 6pOt}^^ 0pl3]^^ Aq. A0^ 

• Sk{u,li) = J2y^T^'^(u)^k{u,v)p^^_p{v,li) 

• Vki'U.ih) {u) ^kiy-i h) 

The location indications extracted from a user’s location sensitive attributes and his LA- 
friends are considered as primary indications, while the location indication captured from 
the LN-friends is only used to regulate the results. Therefore, we integrally train a good set 
of parameters pk and Uk] while we separately train au- 

To train the parameters pk and z/^, we generate a training data set with items (label(/j) : 
features('u, h)), if the probability that a LA-user u lives in h is larger than zero, i.e., 
J2ak&A Wk{u,h) + Sk{u,li)] > 0. In particular, h is labeled as a far location (label(/j) = 0), 
if the distance between /j and m’s actual location is larger than a pre-dehned threshold; oth¬ 
erwise, it is labeled as a close location (label(/j) = 1). Additionally, features(M, k) is a vector 
consisting of ak{u,li) and Sk{u,li), where k G [l,m] represents the k-th location sensitive 
attribute. Based on the generated items, we use a logistic regression method to train the 
model in the following format: 

/(?/|x; (Ti, • • • , dm, (5i, • • • ,Sm) = h^, 5 (x)^(l - 
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where y is the label(/j), x stands for the features('u, k) and is the hypothesis fnnction. 

Then we can apply the gradient descent method to maximize /(y|x;cr, h) and compnte the 
parameters. In the similar way, we can train a set of parameters ak- 


6. Current City Prediction Approach 

To address Challenge 2 of Sec. [H we aggregate the close candidate locations into clnsters 
and devise a two-step cnrrent city selection approach. In this section, referring to Fignre 
m we elaborate the Candidate Locations Cluster, Cluster Selector and Location Selector 
respectively. We snmmarize the prediction approach at the end of this section. 


6.1. Candidate Locations Cluster 

We draw on the hierarchical clustering method, i.e., UPGMA (Unweighted Pair Group 
Method with Arithmetic Mean) 3^ ISj], to generate location clusters. This method arranges 
all the candidate locations in a hierarchy with a treelike structure based on the distance 
between two locations, and successively merges the closest locations into clusters. Algorithm 
m elaborates the clustering process. 

Figure |3] illustrates an example of the clustering results on 154 candidate locations that 
are located in the area with latitude in 47°A ~ 49°A and longitude in 1°W ~ 6 °E. By using 
the hierarchical clustering method, we divide these locations into 5 clusters. We note several 
properties of our location clusters. First, instead of dividing areas with equal-sized grid 


cells 


25| 9|, the hierarchical clustering method only considers the user-generated locations 


while the areas that no user mentions are out of consideration. Second, the densities inside 
the clusters are different; however, the average distances between all the candidate locations 
in any two neighboring clusters are equal {IDDkm in Figure [3]). Third, the complexity of the 
algorithm is 0(|£|^), where |/1| is the total number of the candidate locations. 


6.2. Cluster Selector 

Given a location cluster and a LN-user’s location probability vector obtained by PFLI 
model, we sum up the user’s probabilities of locations inside the cluster as the cluster 
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ALGORITHM 1: Clustering Locations 


Input: All the candidate locations I ^ C] 

Output: Location clusters set C = {ci,C2, • • • ,Cs} (s is the number of clusters); 

Step 1 : treat all / € T as a cluster and calculate the distance between any two locations; 

repeat 

Step 2'. hnd and merge the two closest location clusters into a new location cluster; 
Step 3: compute the average distance between the new cluster and each of the old ones; 
until all the candidate locations are organized into one cluster tree] 

Step 4 ■ cut the cluster tree into clusters with an ideal distance threshold 



Figure 3: Example of Candidate Locations Cluster. 


probability. Cluster selector calculates the probabilities of all the clusters that the LN-user 
may reside in and then selects the cluster with the highest probability. 

6.3. Location Selector 

Finally, we select a best point from the selected cluster as the user’s predicted location 
of the current city. Three alternatives are considered. First, we select the point of the 
highest probability inside the selected cluster as the best point. Second, we consider the 
geographic centroid of the selected cluster as the user’s best point. The geographic centroid 
is the average coordinate for all the points in a cluster while the probability of each point is 
considered as its weight. Third, we calculate the center of minimum distance which has the 
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ALGORITHM 2: Current City Prediction 
Input: A LN-user u’s location sensitive attributes; 
u’s friends list and friends’ location sensitive attributes; 

Location clusters set C = {ci, C 2 , • • • , c*} (s is the number of clusters); 

Output: Predicted current city for u: {lat, Ion)] 

Compute location indications p(m) by u’s location sensitive attributes and LN-friends (Eq. [^; 
Obtain all of LA-friends’ current city C^^_p] 

for li € do 

p{u,li) ^p{u,li) +Pi^^_p{u,li)] 

end 

for Cx (z C do 

Piu)c, = 

end 

Cluster selection: where p{u)cf^ > p{u)c^,Vcx G C; 

Location selection from (Sec. \6.3\) \ 

The predicted current city of u: {lat, Ion) 

minimum overall distance from itself to all the rest of locations in a cluster. We will further 
discuss and compare the three methods in Sec. [71 

6.4- Implementation of Prediction Approach 

We summarize the current city prediction approach in Algorithm [21 In practice, to speed 
up the computation of location probability vector for a given LN-user u, we hrst compute 
location indications from u’s location sensitive attributes and LN-friends: 



(7) 


Assume is the set of current cities of m’s LA-friends. We sum location indications 

from u’s LA-friends PpA_p{u, If) (refer to Eq[2]) to p{u, If, where k G Cpj^_p- 
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7. Evaluation on Current City Prediction 

In this section, we first introduce the experiment setups including the used Facebook 
data set, the compared approaches and the measurements. Then, we report the experiment 
results. 


7.1. Experiment Setup 

7.1.1. Data description 


We crawled Facebook by a Breadth First Search (BFS) approach from March to 
June in 2012 and collected 371, 913 users’ information including profile (e.g., gender, current 
city, hometown) and friends. Among all these users, 153, 909 users publicly report their 
current city (LA-users) and 225,314 users do not reveal their current city (LN-users). All 
these users generate 12, 863 different locations. For more details about this data set, please 
refer to our previous work [i^ . 

To evaluate the prediction approach, a user’s latest work or education experience is 
extracted as a location sensitive attribute, named ‘Work and Education’; we also exploit 
a user’s ‘Hometown’ as another location sensitive attribute. In our data set, 122, 899 LA- 
users show ‘Hometown’, 54, 097 LA-users reveal ‘Work and Education’ and 115, 807 LA-users 
publish their friend lists. 

In addition to the exploited location sensitive information, some other information (e.g., 
a user’s geo-tagged posts) in Facebook may also leak the location. Our prediction approach 
can be extended to consider other location sensitive information smoothly, which we will 
discuss more in Sec. 19.11 


7.1.2. Approaches 

We first compare the different location selection approaches introduced in Sec. 16.31 to 
finalize the prediction approach with a good location selector. We also evaluate the per¬ 
formance of non-cluster prediction approach to show the effectiveness of location cluster. 
Specifically, these approaches can be denoted as: 
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• PFLIprob is a cluster based approach which selects the point of highest probability from 
the selected cluster as the predicted location. 

• PFLIcent is a cluster based approach which selects the geographic centroidi!} from the 
selected cluster as the predicted location. 

• PFLIdist is a cluster based approach which selects the center of minimum distance 
from the selected cluster as the predicted location. 

• PFLInocist is a non-cluster approach which selects the point of highest probability from 
all candidate locations as the predicted location. 

The proposed approaches are also compared to several state-of-the-art methods: 

• Bascdist predicts a user’s location based on the observation that the distance between 
two users decreases by the increase of their friendship [^. 


ocation and applies 
6 |. 


BasCann Hiaps any location sensitive attribute value to a certain 
artihcial neural network to train a current city prediction model 

Bascfreq, borrowiug the idea from the prior works based on the Twitter data set (5|[8|], 
counts the frequency of locations that emerge in a user’s friends and predicts his current 
city by the most frequent location. 

Basefreq+ improves Bascfreq by further using the neighborhood smoothing approach [8|. 
Given a location I, the points that are less than 20km apart from / are considered as 
Ps neighborhoods. 

BasCknn also relies on the frequency idea for Twitter; however, it merely counts on a 


user’s k closest friends w 


most frequent location 


lo have the most common friends with him to compute the 

2 |. 


^ Geographic centroid is the average coordinate for all the points in a cluster while the probability of each 
point is considered as its weight. 
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HdSGdist 

BdS&ann 

Bdse^j-^q 



^ ^^^noclst 

PPLIdist 

PFLIcent 

BFLIpj~Qi) 

AEDmo7o 

8.6 

5.7 

5.9 

4.9 

10.8 

2.5 

49.5 

5.6 

2.1 

AEDmO% 

85.0 

64.3 

91.8 

56.0 

100.0 

40.1 

77.4 

38.0 

36.9 

AED@100% 

1288.5 

1129.0 

1160.5 

1123.7 

1397.6 

874.0 

885.9 

855.3 

854.4 


Table 1: Prediction Results (AED) for Users with LA-Friends 


Among the above approaches, Basedist and BasCann are originally devised for Facebook; 
while Bascfreq, Basefreq+ and Baseunn are on Twitter. We utilize the main ideas from 
Bascfreq, Basefreq+ and Base^nn, and adopt them to £t our data set. By comparing our ap¬ 
proach to Basedist, Bascfreq, Basefreq+ and Baseknn which mainly depend on friendships, we 
test the effectiveness of integrating location sensitive attributes. By comparing to BasCann, 
we examine the newly introduced one-attribute/multiple-locations mapping method. 


7.1.3. Measurement 

Two widely used measurements: Average Error Distance (AED) and Accuracy within K 
km {ACC@K) 5] 8| 24| are exploited. 

Error Distance computes the distance in kilometers between a user u’s real location 
and predicted location, i.e., ErrDist{u). AED averages the Error Distances of the overall 
evaluated users, denoted as AED = ^ addition, we rank the users by their 

Error Distance in descending order and report AED of the top 60%, 80% and 100% of 
the evaluated users in the ranked list, denoted as AED@60%, AED@80% and AED@100% 
respectively [24 1. 


Given a predehned Error Distance K km, a prediction for a user is considered as a correct 
prediction, if the predicted Error Distance is less than K km; otherwise, the prediction is 
incorrect. Then, Accuracy within K km is defined as the percentage of correct predictions 
(i.e., the percentage of users being predicted with an Error Distance less than K km), 
denoted as ACC@K = \iA'>^^UAErrjJist{u)<K}\ ^ shows the prediction capability of an 

approach at a specihc pre-established Error Distance. 
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HdSGdist 

BdS&ann 

Bdse^j-^q 



^ ^^^noclst 

PPLIdist 

PFLIcent 

PFLIpj~Qi) 

AEDmO% 

102.8 

6.7 

73.9 

66.6 

119.5 

3.5 

50.6 

6.3 

3.1 

AEDmO% 

1368.8 

74.7 

1257.2 

1243.1 

1429.6 

52.5 

88.2 

50.2 

49.1 

AED@100% 

2671 

1204.0 

2523.5 

2498 

2698.5 

981.0 

989.9 

960.8 

960.0 


Table 2: Prediction Results (AED) for Overall Users 


1.2. Experiment Results 

Many relationship-based methods (e.g., Basedist, Basefreg, Basefreq+ and Baseknn) rely 
heavily on users’ LA-friends whose locations are exposed. In general, such methods can 
work well for the users who have a certain number of LA-friends; but when they are applied 
to the overall users (who either have or do not have LA-friends), the performance notably 
decreases. We evaluate the prediction performance on two user sets: users with LA-friends 
and overall users, and report the evaluation results on AED and ACC@K subsequently. 

7.2.1. Evaluation on AED 

Table [T] and Table [2] show the AEDs of all the compared approaches for two user sets. 
The smallest AEDs, which are generated by PELIprob, have been highlighted in bold. 

Let us hrst look at the PELI model based approaches (i.e., PELIdist, PFLIcent, PFLIprob, 
and PPLInocist)- Among the hrst three cluster based approaches that are different at their 
location selectors, PELIdist generates the largest AEDs while PELIprob achieves the smallest 
AEDs. We also compare the non-cluster approach PELInodst and the cluster approach 
PELIprob^ which both select location of the highest probability. We observe that PELIprob 
presents smaller AEDs than PELInodst and verify the effectiveness of the location cluster 
approach. 

In addition, the results show that the PELI model based approaches present much smaller 
AEDs than all the other baselines. In particular, the results demonstrate the PELI model 
based approaches mapping one-attribute to multiple locations reduce the AED signihcantly 
compared to Baseann which maps one-attribute to one-location. 

By examining the results of AED@60%, AED@80% and AED@100%, we observe that 
the PELI model based approaches can predict current city with relatively small AED@60% 
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and AED@80%-, whereas, AED@100% increases by 10-23 times from AED@80%. This 
demonstrates the large Error Distance only occurs at predictions for a small number of 
users. 

Lastly, we compare the results in the two Tables and notice that the prior approaches 
{Bascdist, Bascfreq, Basefreq+ and Baseunn) predict locations with much larger AEDs for 
overall users than for users with LA-friends] however, for the PELI model based approaches, 
AEDs differ slightly for two user sets. It demonstrates that a user’s prohle can signihcantly 
contribute to the location prediction when the user’s friends’ locations are unavailable. 

7.2.2. Evaluation on ACC@K 

We study ACC@K of the three proposed prediction approaches {PELIprob, PFLIcent and 
PPLIdist) for two user sets in Figure ID We observe that the accuracy of PPLIproh goes 
up steadily with the increase of Error Distance. PELIcent niay lead to very low accuracy 
when the pre-established Error Distance is quite small; but it can achieve higher accuracy 
than PELIprob, when the pre-established Error Distance is larger than 40 km. This reveals 
the properties of these two prediction approaches: PELIcent, which selects the geographic 
centroid of a cluster, generates a short average Error Distance to all the locations in the 
cluster but fails to pick the user’s exact coordinate once it is not the centroid; while PELIprob 
may produce a large Error Distance if the location of the highest probability is not the user’s 
real location. In addition, PELIdist is not competitive with the other two approaches. 

Rather than solely using any one of the proposed approaches, we exploit a combined- 
approach strategy by flexibly selecting the best approach according to the pre-established 
Error Distance. Specihcally, this strategy uses PELIprob when the pre-established Error 
Distance is smaller than 40 km and otherwise applies PELIcent- The combination is prac¬ 
tical and can obtain a better performance than using any single approach. We plot the 
combination line in Figure [Hand call it PELIcmb- 

Figure [5] compares PELIcmb to various baseline methods in terms of ACC@K. We observe 
that the proposed PELIcmb outperforms all the compared baselines for both user sets. Com¬ 
pared to PELInocist, PFLIcmb iucreases around 1.5% and 1.2% of accuracy on average for 
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(a) Users with LA-Friends (b) Overall Users 

Figure 4: ACC@K of Different Location Selectors. 


users with LA-friends and overall users. This proves the effectiveness of the cluster strategy 
with successive cluster selection and location selection. 



(a) Users with LA-Friends (b) Overall Users 

Figure 5: ACC@K of the Proposed Approach and Other Baselines. 


Comparing Figure 5(a) and 5(b), we observe that the approaches Basefreq, Basefreq+, 
Basedist and Baseknn perform much worse for overall users than for users with LA-friends. 
This observation again indicates that these approaches depend heavily on the friends’ loca¬ 
tions. However, in respect of the other approaches, which integrate location indications from 
both location sensitive attributes and friends (including our previous work Baseann pi), the 


prediction performance for overall users relatively approaches to the performance for users 
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with LA-friends. 


8. Current City Exposure Estimator 


In this section, we pay attention to estimating current city exposure probability for a 
user who hides his current city. We formulate the current city exposure estimation problem 
as: Given, (i) a graph Q = UV(^^,£,C); (a) the public location l{u) for LA-users 

u G ; (Hi) the location sensitive attributes A{u) and the friends list J^{u) for all the 
users u G (iv) a pre-established Error Distance K km, we forecast the current 

city exposure probability within K km and report the exposure risk level for each LN-user 
u eU 

To solve this problem, we run the proposed prediction approach on an aggregation of 
users and conduct analysis on the aggregated prediction results. Furthermore, we apply a 
regression method to construct the exposure model according to the analysis observations. 
Relying on this model, we devise a current city exposure estimator to inform users of their 
current city Exposure Probability within K km and Exposure Risk Level. 

The Exposure Probability within K km (EP@K) represents the probability that a user’s 
current city could be inferred correctly if the pre-established Error Distance is K km. As it 
is conceptually similar to the metric ACC@K, we compute it by the same formula: 

\{u\u e U A ErrDist{u) < K}\ 


EP@K = 


\U\ 


( 8 ) 


Additionally, we set up hve Exposure Risk Levels according to the value of Exposure 
Probability, shown in Table [3l Level 5 is dehned as the most risky level, which indicates 
an Exposure Probability higher than 0.9, while Level 1 is the safest one, which represents a 
small Exposure Probability lower than 0.25. 

Next, we show some observations of inspections on the aggregated prediction results. We 
then introduce the current city exposure model and the model based estimator. Finally, we 
illustrate some case studies to show the use of our proposed exposure estimator. We also 
summarize some guidelines to reduce the exposure risk. 
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Exposure Probability 

[0.9,1] 

[0.75,0.9) 

[0.5,0.75) 

[0.5,0.25) 

[0.25,0] 

Risk Level 

Level 5 

Level 4 

Level 3 

Level 2 

Level 1 


Table 3: Risk Level vs. Exposure Probability 


User’s Visible Attributes 

Abbreviation 

‘Hometown’ 

‘HT’ 

‘Work and Education’ 

‘WE’ 

‘Friends’ 

‘F’ 

‘Hometown’ and ‘Work and Education’ 

‘HT+WE’ 

‘Hometown’ and ‘Friends’ 

‘HT+F’ 

‘Work and Education’ and ‘Friends’ 

‘WE+F’ 

‘Hometown’, ‘Work and Education’ and ‘Friends’ 

‘HT+WE+F’ 


Table 4: Users Categories by Visible Attributes Combination 
8.1. Current City Exposure Inspection 

In this subsection, we extract several measurable characteristics from users’ self-exposed 
information (e.g., User Category), and inspect the current city exposure probability by these 
characteristics. 

First, we classify users into diverse categories with respect to the combinations of visi¬ 
ble/invisible properties of their location sensitive attributes and friends list. Table 0] lists the 
obtained seven User Categories. User Category measures the types and amount of users’ 
self-exposed information. 



Figure 6: Current City Exposure Probability by User Category. 





















Figure |6] inspects the Exposure Probabilities for various User Categories. From this hgure, 
we observe that different types of self-exposed information may divulge users’ current city 
to different extent. For instance, users in ‘WE’ category are normally more dangerous to 
disclose their current city than users in ‘HT’ or ‘F’ categories. We also hnd that the users 
who publish their ‘WE’ (in ‘WE’, ‘HT-hWE’, ‘WE-hF’ or ‘HT-hWE-hF’ categories) exhibit 
a high Exposure Probability. This means that ‘WE’ is a very risky attribute to leak users’ 
current city. The results also reveal that ‘HT’ is more sensitive to disclose current city than 
‘F’, although ‘F’ is generally regarded as a signihcant location indication. 

Figure |6] also indicates that a user’s current city generally could be predicted with a 
higher probability if the user exposes more information. For example, users who expose 
‘HT-I-F’ exhibit a higher exposure probability than users only revealing either ‘HT’ or ‘F’. 
Note that, for a user who exposes ‘HT-I-WE’, his current city exposure probability can be up 
to 90%, which approaches to the exposure probability of users who expose ‘HT-I-WE-I-F’. In 
other words, merely exposing ‘HT-I-WE’ can almost lead to the exposure of a user’s current 
city. To conclude. User Category., which distinguishes users by the types and amount of 
their self-exposed information, relates to Exposure Probability. 

In addition to User Category, we study the influence of the percentage of friends with at¬ 
tributes (i.e., % Priends with Attributes) on Exposure Probability. % Priends with Attributes 
is the ratio of a user’s friends who present at least one attribute to his overall friends. 



Figure 7: Current City Exposure Probability by the Percentage of Friends with Attributes. 
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Figure [7] displays the Exposure Probability (i.e., EP, Z axis) by % Friends with Attributes 
(i.e., FA, X axis) at different Error Distances (i.e., ED, Y axis). As more than 95% of the 
users have a % Friends with Attributes smaller than 45%, we only look at its value in a range 
of 0% to 45%. Generally speaking. Exposure Probability grows by the increase of % Friends 
with Attributes. 





(a) ‘HT’ 


(b) ‘WE’ 


(e) ‘HT+F’ 


(f) ‘WE+F’ (g) ‘HT+WE+F’ 


Figure 8: Exposure Probability by Cluster Confidence in Different User Categories. 






In addition, we dehne a new metric named Cluster Confidence. It estimates the ratio of 
the probabilities of candidate locations in the selected cluster Ch to the overall probabilities 
of all the candidate locations (equal 1), calculated as follows: 


CC{u) = 




iGCh 


( 9 ) 


Cluster Confidence represents the confidence of the users’ location indications. For exam¬ 
ple, Cluster Confidence with a value of 100% means that all of a user’s location indications 
point to an exclusive location cluster. We further look into the change of Exposure Probability 
according to Cluster Confidence for each User Category. 

Figure [S] reveals how Exposure Probability (i.e., EP, Z axis) varies with diverse Cluster 

Confidence (i.e., CC, X axis) and Error Distances (i.e., ED, Y axis) in different User 
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Categories. The results show that the Exposure Probability normally grows up when the 
Cluster Confidence gets larger. When the Cluster Confidence equals 100%, the Exposure 
Probability surpasses 90% within a pre-established Error Distance of 20 km almost for all 
User Categories. This observation indicates that the current city is more dangerous to be 
predicted when a user’s location indications are more likely to point to one city or to multiple 
cities that are in the same cluster. In other words, a user’s current city can be easily disclosed 
if the conhdence of the user’s self-exposed information is high. 

Note that, there exists an exception for the users only exposing their ‘F’: the decline 
of Exposure Probability when the Cluster Confidence is larger than 0.9. One reasonable 
explanation is that only the users with an extremely small number of friends (e.g., only one 
friend) can have the Cluster Confidence higher than 0.9, which might reduce the exposure 
risk of current city due to the limited information. 

8.2. Estimating Current City Exposure Risk 
8.2.1. Current City Exposure Model 

In the previous section, we observe that a user’s current city Exposure Probability is 
probably influenced by four factors: Error Distance, User Category, % Eriends with At¬ 
tributes and Cluster Confidence. Taking these four factors as features, we respectively use 
Random Decision Forest and Linear Regression approaches to model Exposure Probability. 
The performance of model is evaluated by two commonly used metrics. Mean Absolute Error 
(MAE) and Root Mean Sguared Error (RMSE), with 10-cross validation, shown in Table O 
We observe that the Random Decision Forest based model outperforms the Linear Regres¬ 
sion based model by presenting smaller MAE and RMSE. Therefore, we employ the Random 
Decision Forest based model to estimate current city exposure probability, denoted as RDE 
Exposure Model. 

Furthermore, ‘Leave-one-feature-out’ approach is exploited to verify the effectiveness of 
the features. We use Random Decision Forest approach to train exposure models by taking 
out any one of the four features, namely No Error Distance, No User Category, No % Eriends 
with Attributes and No Cluster Confidence. Table |6] compares these ‘Leave-one-feature-out’ 
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Random Decision Forest 

Linear Regression 

MAE 

0.027 

0.061 

RMSE 

0.077 

0.146 


Table 5: Performance Comparison of Exposure Models 



RDF Exposure 

No Error 

No User 

No % Friends 

No Cluster 


Model 

Distance 

Category 

with Attributes 

Confidence 

MAE 

0.027 

0.052 

0.065 

0.045 

0.082 

RMSE 

0.077 

0.106 

0.131 

0.117 

0.166 


Table 6: Feature Verification of RDF Exposure Model 

models to the RDF Exposure Model. We observe that the RDF Exposure Model presents the 
best performance with the smallest MAE and RMSE. The performance degradations when 
removing any one of the features just verify that all the four studied features contribute 
to the model. Cluster Confidence is observed as the most sensitive feature for the model, 
because the performance of the RDF Exposure Model drops most signihcantly when Cluster 
Confidence is taken out. 

8.2.2. Current City Exposure Estimator 

By exploiting the proposed current city exposure model, we construct an exposure esti¬ 
mator to forecast the exposure risk of a user’s private current city. Figure |9] illustrates the 
framework of the current city exposure estimator. The exposure estimator contains three 
main function modules: user information handler, current city exposure model and exposure 
risk level decision. The inputs of the exposure estimator include a user’s self-exposed infor¬ 
mation and a pre-established Error Distance. Given a user’s self-exposure information, the 
user information handler determines User Category, and computes Cluster Confidence and 
% Friends with Attributes. Based on the pre-established Error Distance, the obtained User 
Category, Cluster Confidence, and % Friends with Attributes, the exposure model calculates 
the current city exposure probability for the user. The exposure risk module determines a 
risk level according to the exposure probability. Finally, the exposure estimator outputs two 
risk measurements of current city: Exposure Probability and Risk Level. 
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Figure 9: Framework of Current City Exposure Estimator. 


8.3. Case Studies: Exposure Estimator and Privacy Protection 


User 

User 

Category 

Cluster 

Confidence 

Error 

Distance 

% Friends with 

Attribute 

Exposure 

Probability 

Risk 

Level 

Ul 

‘HT+WE+F’ 

0.69 

100/cm 

0.9% 

0.967 

Level 5 

U1 

‘HT+WE+F’ 

0.69 

2i}km 

0.9% 

0.883 

Level 4 

U2 

‘F’ 

0.208 

100km 

11.2% 

0.564 

Level 3 

U3 

‘F’ 

0.208 

100km 

0.2% 

0.374 

Level 2 

Ui 

‘WE+F’ 

0.281 

100km 

2.1% 

0.407 

Level 2 

U5 

‘WE+F’ 

0.57 

100km 

2.1% 

0.797 

Level 4 

m 

‘HT+F’ 

0.332 

20km 

20.1% 

0.276 

Level 2 

U7 

‘HT+WE’ 

0.73 

100km 

0% 

0.903 

Level 5 

U8 

‘HT’ 

0.169 

20km 

0% 

0.059 

Level 1 

U9 

‘WE’ 

0.404 

20km 

0% 

0.834 

Level 4 

UlO 

‘F’ 

0.891 

20km 

17.2% 

0.823 

Level 4 


Table 7: Exposure Estimator Cases Study 


Any LN-users who reveal their self-exposed information and pre-dehne an Error Distance 
can use the proposed current city exposure estimator to assess their Exposure Probability 
and Risk Level. To better understand the use of exposure estimator, we illustrate several 
use cases in Table [3 In this study, we observe that some of the LN-users are not really 
safe to hide their current city if they leave some other information visible. For instance, 
considering 179, even though only ‘WE’ is published, his current city is almost leaked with 
an extremely high Exposure Probability of 0.834 within an Error Distance of 20 km. In 
addition, for users in the same User Category^ the one with a higher Cluster Confidence is 
more likely to divulge his current city. Looking at 174 and 175 who are both in ‘WE-I-F’ 
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Ul 

Current status 

Hide 

‘HT+WE+F’ 

‘WE’ 

‘F’ 

‘HT’ 

‘WE+F’ 

‘HT+WE’ 

Exposure 

Probability 

0.967 

0.503 

0.944 

0.936 

0.456 

0.073 

Risk Level 

Level 5 

Level 3 

Level 5 

Level 5 

Level 2 

Level 1 


Table 8: Exposure Guidelines for ?71: the exposure risks if he adjusts some privacy configurations 
with an Error Distance of 100/cm 

category, the current city of U5 who exhibits a higher Cluster Confidence is more dangerous 
to be inferred, compared to /74’s current city. 

In addition, the exposure estimator can offer some countermeasures on privacy conhgura- 
tion against information leakage. Assume users hide some part of their exposed information, 
the exposure estimator estimates and reports the corresponding Exposure Probability and 
Exposure Risk Level. Then users can decide on a new privacy conhguration accordingly. We 
take U1 as an example and list some possible exposure risks assuming that he adjusts his 
privacy conhguration. The results shown in Table [8] reveal that the exposure risk could be 
signihcantly decreased if U1 hides his ‘HT+WE’, ‘WE+F’ or ‘WE’. The results also point 
out that merely hiding ‘F’ or ‘HT’ cannot protect UVs current city privacy. 

Finally, according to the studies on current city exposure risk, we summarize the following 
general suggestions: 

• As all the location indications may expose the hidden current city, close all of location 
sensitive information including ‘WE’, ‘F’ and ‘HT’ so as to achieve a high current city 
security. 

• Hide the most sensitive exposed information (e.g., ‘WE’) if users want to publicly 
share some personal information (e.g., ‘F’), since the most sensitive information can 
independently lead to a quite high Exposure Probability. For example, ‘WE’ alone can 
lead to an Exposure Probability higher than 80%. 

• According to the centrality principle which refers to the Cluster Confidence, hide ‘F’ if 
most friends indicate the same place where the user lives. For instance, 1710 in Table 
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[7] is necessarily advised to hide his ‘F’. 


9. Discussion and Futnre Work 

In this section, we discuss some issues which are not addressed in this work due to space 
limitations, and point out some future potential research directions. 

9.1. Extensibility of the Current City Prediction Approach 

Due to the data set limitation, we only use three features (i.e., ‘Hometown’, ‘Work and 
Education’ and ‘Friend’) to evaluate our proposed current city prediction approach. How¬ 
ever, our prediction approach can be extended to consider other location sensitive attributes. 
For instance, for the location sensitive pages that a user follows (e.g., the page of a favorite 
local restaurant) or the location sensitive posts that a user published (e.g., geo-tagged posts), 
we can regard one page or one post as a LA-Friend and refer to LA-FLI model to explore 
the location indications. 


9.2. Adaptability of the Exposure Estimation Approach 

In addition, our exposure estimation approach can easily adapt to other current city 
prediction approaches by the two-step solution: (1) feature extraction (Sec. 18.11) and (2) ex¬ 
posure model training fSec. 18.21) . In particular, we can Erst extract similar features for other 
city prediction approaches as the inspected features in Sec. 18.11 Take Cluster Confidence as 
an example. For the cluster-based city prediction approaches like ours. Cluster Confidence 
can be extracted in the same way, i.e., the largest cluster prediction probability (EqJH]). For 
the other city prediction approaches without a clustering step |3| [2^ , following the essence of 
Cluster Confidence, a similar feature. Prediction Confidence, can be computed as the largest 
city prediction probability. Likewise, we can also obtain the other features presented in our 
exposure model for many other city prediction approaches, while we do not discuss them 
further for brevity. Once the features are derived, in the second step, we can directly apply 
the regression methods used in Sec. 18.21 to train the exposure models for other prediction 
approaches. 
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9.3. Generalizability of the Exposure Estimator 

Taking ‘current city’ as a representative attribute to study the information exposure issue, 
this work gives further insights on how to assess the exposure risk of other privacy-sensitive 
attributes (e.g., age). Denoting the privacy-sensitive attribute as PSA, the process to assess 
its exposure risk can be generalized into three steps: 1) Explore PSA-sensitive attributes 
and construct a PSA prediction model; 2) Inspect the prediction results to extract features 
and train a PSA exposure model; 3) Based on the exposure model, implement an exposure 
estimator to notify users of the exposure risk and provide suggestions to lower the risk if 
necessary. 

Moreover, our future work will consider integrating multiple exposure models into the 
exposure estimator, so as to construct an exposure estimation system that can provide 
reliable and multi-functional exposure risk estimations. 

10. Conclusion 

This paper starts with two open questions regarding the security of users’ hidden privacy- 
sensitive attributes. To answer these questions, we first propose a novel current city pre¬ 
diction approach to infer users’ current city by leveraging users’ self-exposed information 
including location sensitive attributes and friends list. We validate the new prediction ap¬ 
proach on a Facebook data set containing 371,913 users, and the results reveal that the 
users’ hidden current city may be dangerous to be predicted. Then we apply the proposed 
prediction approach to predict users’ current city and model the exposure probability by 
considering four measurable characteristics — Cluster Confidence, Error Distance, User 
Category and Percentage of Eriends with Attributes. Based on the exposure model, we pro¬ 
pose a current city exposure estimator to measure the exposure probability and risk level 
of users’ hidden current city according to their self-exposed information. The exposure 
estimator can also help users to adjust their privacy conhguration to satisfy their privacy 
requirements. While this work studies the potential risk of users’ privacy-sensitive attributes 
with a representative attribute of current city in Facebook, the proposed idea and approach 

could be extended to other attributes and utilized by other OSNs. 
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